Downstream Analysis of Synteny Networks with MSA2dist

Kristian K Ullrich

Max Planck Institute for Evolutionary Biology

Background - DNA Evolution

A matter of distance.

Background - Coding Sequences

Synonymous and NonSynonymous Substitutions: nucleotide substitutions might lead to changes on amino acid level.

Typically used to:

  1. Detect and Date whole-genome duplication (WGD) events;
  2. Predict selective pressure acting on a protein;
  3. Predict selective pressure acting on a codon;
  4. Investigate codon usage (R/Bioconductor package coRdon).

MSA2dist

An R/Bioconductor package to calculate pairwise distances between all sequences of a MultipleSequenceAlignment (DNAStringSet or AAStringSet) and to conduct codon based analysis.

MSA2dist Features:

  1. Calculates pairwise AA, DNA and IUPAC distances;
  2. Calculates Synonymous and NonSynonymous Substitutions (dN/dS) KaKs_Calcultor 2.0 models (C++ ported to R with Rcpp);
  3. Use pre-calculated MSA or conduct all possible pairwise alignments prior dN/dS calculation;
  4. Calculate average behavior of each codon from MSA.

MSA2dist workflow

Installation

From Bioconductor:

BiocManager::install(version='devel')
BiocManager::install("MSA2dist")


From GitHub:

remotes::install_github("kullrich/MSA2dist")

Example data set

Data: Whole-genome coding sequences and gene ranges for Chlorophyta algae + Physcomitrium patens as an outgroup.

Source: Pico-PLAZA 3.0

Where to find: the data/ directory in the GitHub repo associated with this presentation.

data
├── clusters.rda
├── codingsequences.rda
└── data_acquisition_cds.md
  1. codingsequences.rda: List of DNAStringSet objects.
  2. clusters.rda: Pre-calculated syntenet clusters.

Example data set: codingsequences

load(here::here("data", "codingsequences.rda"))

# Inspect data
names(codingsequences)
 [1] "Aprotothecoides"   "Helicosporidiumsp" "Chlorellasp"      
 [4] "PicRCC4223"        "PicSE3"            "Asterochlorissp"  
 [7] "Csubellipsoidea"   "Creinhardtii"      "Vcarteri"         
[10] "Bprasinos"         "Otauri"            "Osp"              
[13] "Olucimarinus"      "Omediterraneus"    "Msp"              
[16] "Mpusilla"          "Ppatens"          
head(codingsequences$Aprotothecoides)
DNAStringSet object of length 6:
    width seq                                               names               
[1]   531 ATGGAGGGTGTGCACAGGCAGCT...CGGGCACTGCCACCCCCATCTGA AP00G00010
[2]  1542 ATGCTCGCAGTCCTGGCTCGCCG...TGCGCCGCAGGCTGCTGGGCTGA AP00G00020
[3]  1248 ATGGAGGCAGAGACGGCCCAGCC...GCAGCTGGAAGGCGAAGTCATAG AP00G00030
[4]  1086 ATGGAGGCTCTCTCCGGCACCCG...TGGCCGGCGCGAGCTCGGTGTGA AP00G00040
[5]   930 ATGAGCTTCGTGACGGTGGGAGG...GGAGGCTCTACCCCGGCATGTGA AP00G00050
[6]   831 ATGGCCAACCTTTTTATCAACCT...TGCCAGAGGAGATCATCCTGTGA AP00G00060

Example data set: clusters

load(here::here("data", "clusters.rda"))

# Inspect data
head(clusters)
                   Gene Cluster
1 Chlo_CNC64A_028G00030       1
2 Chlo_CNC64A_028G00040       2
3 Chlo_CNC64A_028G00070       3
4 Chlo_CNC64A_028G00080       4
5 Chlo_CNC64A_028G00110       5
6 Chlo_CNC64A_028G00140       6
length(table(clusters$Cluster))
[1] 22605

Fetch coding sequences for a specific cluster

library(MSA2dist)
library(dplyr)
library(stringr)

# Get random cluster of size 10
my_cluster <- clusters %>% dplyr::filter(
    Cluster==sample(which(table(clusters$Cluster)==10), 1))
head(my_cluster)
                      Gene Cluster
1 Bpra_BPRRCC1105_15G00170   15748
2     Osp_ORCC809_19G00990   15748
3          Oluc_OL19G00210   15748
4          Oluc_OL19G00220   15748
5         Omed_OM_19G00230   15748
6         Omed_OM_19G00240   15748

Fetch coding sequences for a specific cluster

# Get species unique identifier
ab_len <- syntenet:::species_id_length(codingsequences)
s_abbrev <- substr(names(codingsequences), 1, ab_len)

# Fetch coding sequences
my_cds <- do.call(c, apply(
    stringr::str_split_fixed(my_cluster$Gene, "_", 2), 1,
    function(x) {
        tmp<-codingsequences[s_abbrev==x[1]][[1]];
        tmp[names(tmp)==x[2]]}))
my_cds
DNAStringSet object of length 10:
     width seq                                              names               
 [1]  1563 ATGGCTACTTATTGCGTCGGCGA...TCCAATGCCGACGGAAATGTAA BPRRCC1105_15G00170
 [2]  3579 ATGCTGTGTGTCGGCGCCGTGTC...ACCCGTCGACGGCGCCGCGTAA ORCC809_19G00990
 [3]  3276 ATGGGCGAAAGACGCGCGACGAC...CGTCACGTTGGATAGTCACTGA OL19G00210
 [4]  1302 ATGGCTGTTGGTGATGACGTCGA...CGTCGCTCCGGAACCAGCGTAA OL19G00220
 [5]  3126 ATGCGCGACGCGACGCGCGATGC...CGACGACGACGACGACGCGTAA OM_19G00230
 [6]  1440 ATGTCTGCGGTGTATTGCGTGGA...CAAGCCGGTGCCGGTTGAATAA OM_19G00240
 [7]  1905 ATGAAGGGGCACGCGACCCTTGA...CATGGAGCCCTCCAAGGTTTAA MRCC299_10G01060
 [8]  1434 ATGGCCTCATGGTGCACCGCCGC...CTCCGCGTACAACAACGCATAA MP13G03670
 [9]  1389 ATGTCGCTCACCGAATCTGGCGC...CGTCGCCCCGGAACCGGCGTAA OT_20G00200
[10]  3273 ATGGGAAACGGTCCCGAGGTCGA...GGAGGATTTCTCAGTCGTATGA OT_20G00190

Coding sequence alignment

To calculate a coding sequence alignment for two sequences:

# Get coding sequence alignment
cds1_cds2_aln <- cds2codonaln(cds1 = my_cds[1], cds2 = my_cds[2], remove.gaps = FALSE)
cds1_cds2_aln
DNAStringSet object of length 2:
    width seq                                               names               
[1]  3585 ATG---------------GCTAC...TGCCGACGGAA------ATGTAA BPRRCC1105_15G00170
[2]  3585 ATGCTGTGTGTCGGCGCCGTGTC...AACCCGTCGACGGCGCCGCGTAA ORCC809_19G00990

To calculate dN/dS from this alignment:

# Calculate dN/dS for this alignment; model = Li
cds1_cds2_kaks <- dnastring2kaks(cds1_cds2_aln,
    model = "Li", threads = 1, isMSA = TRUE)
cds1_cds2_kaks
  Comp1 Comp2                seq1             seq2       ka       ks
1     1     2 BPRRCC1105_15G00170 ORCC809_19G00990 0.276828 1.725513
          vka       vks
1 0.000758775 0.1024983

Parallelized pairwise coding sequence alignments

To calculate all pairwise combinations:

start_time <- Sys.time()
# Calculate dN/dS for the whole cluster; model = YN
my_cds_kaks <- dnastring2kaks(my_cds,
    model = "YN", threads = 2, isMSA = FALSE)
end_time <- Sys.time()

head(my_cds_kaks)
         Comp1 Comp2       seq1                seq2 Method        Ka      Ks
result.1     1     2 OL13G02250          OL13G02260     YN  0.916497 3.64537
result.2     1     3 OL13G02250 BPRRCC1105_04G02240     YN  0.187875 3.57243
result.3     1     4 OL13G02250    ORCC809_13G00730     YN 0.0406041 2.06945
result.4     1     5 OL13G02250          OL21G00740     YN  0.916497 3.64537
result.5     1     6 OL13G02250         OM_08G02050     YN  0.061288 3.46937
result.6     1     7 OL13G02250    MRCC299_06G05020     YN  0.150086      99
              Ka/Ks P-Value(Fisher) Length S-Sites N-Sites Fold-Sites(0:2:4)
result.1   0.251414     4.97415e-11    612 129.087 482.913                NA
result.2  0.0525901     2.12165e-60    687 117.125 569.875                NA
result.3  0.0196207     1.05183e-59    690 124.832 565.168                NA
result.4   0.251414     4.97415e-11    612 129.087 482.913                NA
result.5  0.0176655     1.21514e-62    690 138.058 551.942                NA
result.6 0.00151602               0    687 100.966 586.034                NA
         Substitutions S-Substitutions N-Substitutions
result.1           361         107.055         253.945
result.2           204         109.458         94.5418
result.3           110         87.6938         22.3062
result.4           361         107.055         253.945
result.5           135         102.529         32.4711
result.6           193          113.34         79.6598
         Fold-S-Substitutions(0:2:4) Fold-N-Substitutions(0:2:4)
result.1                          NA                          NA
result.2                          NA                          NA
result.3                          NA                          NA
result.4                          NA                          NA
result.5                          NA                          NA
result.6                          NA                          NA
         Divergence-Time Substitution-Rate-Ratio(rTC:rAG:rTA:rCG:rTG:rCA/rCA)
result.1         1.49209                              1.44681:1.44681:1:1:1:1
result.2        0.764902                            0.795759:0.795759:1:1:1:1
result.3        0.407654                                2.7026:2.7026:1:1:1:1
result.4         1.49209                              1.44681:1.44681:1:1:1:1
result.5        0.743189                              1.66709:1.66709:1:1:1:1
result.6         14.6777                            0.848586:0.848586:1:1:1:1
                                    GC(1:2:3) ML-Score AICc Akaike-Weight Model
result.1 0.535276(0.483129:0.476994:0.645706)       NA   NA            NA    NA
result.2 0.384194(0.361502:0.336463:0.454617)       NA   NA            NA    NA
result.3  0.46732(0.401961:0.404139:0.595861)       NA   NA            NA    NA
result.4 0.535276(0.483129:0.476994:0.645706)       NA   NA            NA    NA
result.5 0.422089(0.394692:0.372432:0.499144)       NA   NA            NA    NA
result.6 0.477585(0.401302:0.392625:0.638829)       NA   NA            NA    NA

How long did it take?

end_time - start_time
Time difference of 7.145215 secs

Calculate average behavior of each codon

Here, a pre-calculated MSA is necessary:

# load example data
data(hiv, package="MSA2dist")

# calculate average behavior
hiv.xy <- hiv |> dnastring2codonmat() |> codonmat2xy()

Plot average behavior of each codon:

Here’s where you can find me: